36 research outputs found
A novel dimensionality reduction technique based on independent component analysis for modeling microarray gene expression data
DNA microarray experiments generating thousands of gene
expression measurements, are being used to gather information from tissue and cell samples regarding gene expression differences that will be useful in diagnosing disease. But one challenge of microarray studies is the fact that the number n of samples collected is relatively small compared to the number p of genes per sample which are usually in thousands. In statistical terms this very large number of predictors compared to a small number of samples or observations makes the classification problem difficult. This is known as the ācurse of dimensionality problemā. An efficient way to solve this problem is by using dimensionality reduction techniques. Principle Component Analysis(PCA) is a leading method for dimensionality reduction of gene expression data which is optimal in the sense of least square error. In this paper we propose a new dimensionality reduction technique for specific bioinformatics applications based on Independent component Analysis(ICA). Being able to exploit higher order statistics to identify a linear model result, this ICA based dimensionality reduction technique
outperforms PCA from both statistical and biological
significance aspects. We present experiments on NCI 60
dataset to show this result
A factor analysis model for functional genomics
BACKGROUND: Expression array data are used to predict biological functions of uncharacterized genes by comparing their expression profiles to those of characterized genes. While biologically plausible, this is both statistically and computationally challenging. Typical approaches are computationally expensive and ignore correlations among expression profiles and functional categories. RESULTS: We propose a factor analysis model (FAM) for functional genomics and give a two-step algorithm, using genome-wide expression data for yeast and a subset of Gene-Ontology Biological Process functional annotations. We show that the predictive performance of our method is comparable to the current best approach while our total computation time was faster by a factor of 4000. We discuss the unique challenges in performance evaluation of algorithms used for genome-wide functions genomics. Finally, we discuss extensions to our method that can incorporate the inherent correlation structure of the functional categories to further improve predictive performance. CONCLUSION: Our factor analysis model is a computationally efficient technique for functional genomics and provides a clear and unified statistical framework with potential for incorporating important gene ontology information to improve predictions
Recommended from our members
CFH and ARMS2 genetic risk determines progression to neovascular age-related macular degeneration after antioxidant and zinc supplementation
We evaluated the influence of an antioxidant and zinc nutritional supplement [the Age-Related Eye Disease Study (AREDS) formulation] on delaying or preventing progression to neovascular AMD (NV) in persons with age-related macular degeneration (AMD). AREDS subjects (n = 802) with category 3 or 4 AMD at baseline who had been treated with placebo or the AREDS formulation were evaluated for differences in the risk of progression to NV as a function of complement factor H (CFH) and age-related maculopathy susceptibility 2 (ARMS2) genotype groups. We used published genetic grouping: a two-SNP haplotype risk-calling algorithm to assess CFH, and either the single SNP rs10490924 or 372_815del443ins54 to mark ARMS2 risk. Progression risk was determined using the Cox proportional hazard model. Geneticsātreatment interaction on NV risk was assessed using a multiiterative bootstrap validation analysis. We identified strong interaction of genetics with AREDS formulation treatment on the development of NV. Individuals with high CFH and no ARMS2 risk alleles and taking the AREDS formulation had increased progression to NV compared with placebo. Those with low CFH risk and high ARMS2 risk had decreased progression risk. Analysis of CFH and ARMS2 genotype groups from a validation dataset reinforces this conclusion. Bootstrapping analysis confirms the presence of a geneticsātreatment interaction and suggests that individual treatment response to the AREDS formulation is largely determined by genetics. The AREDS formulation modifies the risk of progression to NV based on individual genetics. Its use should be based on patient-specific genotype
The utility and predictive value of combinations of low penetrance genes for screening and risk prediction of colorectal cancer
Despite the fact that colorectal cancer (CRC) is a highly treatable form of cancer if detected early, a very low proportion of the eligible population undergoes screening for this form of cancer. Integrating a genomic screening profile as a component of existing screening programs for CRC could potentially improve the effectiveness of population screening by allowing the assignment of individuals to different types and intensities of screening and also by potentially increasing the uptake of existing screening programs. We evaluated the utility and predictive value of genomic profiling as applied to CRC, and as a potential component of a population-based cancer screening program. We generated simulated data representing a typical North American population including a variety of genetic profiles, with a range of relative risks and prevalences for individual risk genes. We then used these data to estimate parameters characterizing the predictive value of a logistic regression model built on genetic markers for CRC. Meta-analyses of genetic associations with CRC were used in building science to inform the simulation work, and to select genetic variants to include in logistic regression model-building using data from the ARCTIC study in Ontario, which included 1,200 CRC cases and a similar number of cancer-free population-based controls. Our simulations demonstrate that for reasonable assumptions involving modest relative risks for individual genetic variants, that substantial predictive power can be achieved when risk variants are common (e.g., prevalenceĀ >Ā 20%) and data for enough risk variants are available (e.g., ~140ā160). Pilot work in population data shows modest, but statistically significant predictive utility for a small collection of risk variants, smaller in effect than age and gender alone in predicting an individualās CRC risk. Further genotyping and many more samples will be required, and indeed the discovery of many more risk loci associated with CRC before the question of the potential utility of germline genomic profiling can be definitively answered
Soft decision trees
grantor:
University of TorontoSoft Decision Trees (SDT's) are a new class of semi-parametric methods for classification and regression. They attempt to retain the features that made tree-like techniques widely popular (interpretability, graphical summary of the result, automatic variable selection and interaction detection, etc.) while improving their predictive performance and making the model more believable. This is done by employing "soft", or stochastic splits which result in blurred partition boundaries and a continuous prediction surface. The parameters are fitted via Maximum Likelihood, using the EM algorithm. Simulation experiments indicate that the SDT's are indeed more powerful predictors. Real data analysis shows that SDT's can also aid in interpretation.M.Sc
Statistical analysis of medical images with applications to neuroimaging
grantor:
University of TorontoWe extend a classical multivariate technique: Linear Discriminant Analysis (LDA) and apply it in the analysis of PET and fMRI images of human brain function to discover regions of activation driven by the experimental stimuli. We re-examine and specialize some equivalences between LDA and: Canonical Correlation Analysis (CCA) and Multivariate ANOVA (MANOVA). Furthermore, efficient algorithms are derived to facilitate applying these multivariate models to extremely large image data. We deal with the ill-posed nature of the problem using spatial basis expansion and the penalization (with Penalized Discriminant Analysis (PDA) of Hastie et al. (1995)), and utilize efficient measures of predictive performance to optimize hyperparameters and validate the models in a robust fashion. We examine expanding the images into a 3D tensor-product B-spline and Wavelet basis and compare to the results obtained without expansion. Some parallels between our proposal and some of those currently popular in the neuroimage community are discussed. Another extension to PDA is derived and applied that allows one to model time series effects that exist in fMRI images. We conclude with many possible enhancements to the proposed paradigm.Ph.D
Reduced-Rank Multivariate Model for Time-Course Microarray Data
Abstract: In this paper we present a novel, multi-gene approach to time course microarray experiments. One of the advantages of our approach is an explicit modeling of correlation structure among gene expression data. The approach proposed is computationally attractive. We apply the model to the well-known cell-cycle yeast microarray data and present results that compare favorably to the results of the previous studies